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Abstract — Data inconsistencies are present in the data collected 
over a large wireless sensor network (WSN), usually deployed for 
any kind of monitoring applications. Before passing this data to 
some WSN applications for decision making, it is necessary to 
ensure that the data received are clean and accurate. In this 
paper, we have used a statistical tool to examine the past data to 
, lit in a highly sophisticated prediction model i.e., ARIMA for a 
given sensor node and with this, the model corrects the data using 
forecast value if any data anomaly exists there. Another scheme 
is also proposed for detecting data anomaly at sink among the 
aggregated data in the data are received from a particular sensor 
node. The effectiveness of our methods are validated by data 
collected over a real WSN application consisting of Crossbow 
IRIS Motes |1|. 
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I. Introduction 

Wireless Sensor Networks (WSNs) are formed by large 
number of autonomous units called sensor nodes. Each sensor 
node has the capability of sampling data, processing it and 
, sending the data through radio transmitters. In this aspect, 

■ each sensor node is independent of its sampling and sending 
mechanisms and its values. This independent working nature 
of sensor nodes set up notion of independent data transmission 

I to the base station. The aggregated data at sink are indepen- 
' dent. The base station is a processing center also called sink 
node or simple sink. 

WSNs are extensively used in natural environment monitor- 
ing and inventory management. Lots of specific applications 
have been developed to monitor very delegate processes that 
include: nuclear reactor control, habitat monitoring, object 
tracking, mines monitoring, fire detection, wild life monitor- 
ing, etc. Depending on the application and user requirement, 
sensor nodes report the data to the sink either in synchronous 
mode or in asynchronous mode. Usually sensor nodes sensed 
data in a fixed time indexed manner and transmit the data to 
the sink periodically. 

The WSNs based applications that we have mentioned 
above use aggregated data to perform a certain task and give 
meaningful outputs to the network or to the user The aggre- 
gated data from the WSN may be affected by anomalies in the 
WSN. The anomaly detection is possible when the aggregated 
data at sink do not follow a certain pattern [2\. Anomalous 
data patterns can be caused due to case 1: unreliability of 



wireless sensor networks or case 2: due to occurrence of 
unusual phenomena in the monitored region. For case 1: the 
unreUability of wireless sensor networks incurs faulty sensors 
and the faults occur due to hardware malfunction, sampling 
errors, transmission loss etc. Detecting data anomaly for both 
of the cases, case 1 and 2 are very important with respect to 
any type of monitoring applications. One important objective 
of WSN application is to detect the occurrence of unusual 
phenomena in the monitored region and to take necessary 
action for that. Another objective of WSN application is to 
make appropriate decisions based on aggregated data at sink 
in spite of the unreUability of the wireless sensor networks. 
Hence, it becomes crucial for us to correct the data before 
applying it to the applications. Otherwise, the anomalous data 
produced due to unreliability of wireless network will have a 
great impact on making appropriate decisions. 

In this paper we make an attempt to exploit behavior of a 
single node over a considerable time to correct data if there 
is any anomaly in the data. Specifically, we fit a statistical 
model to a single node as we know, all nodes transmit data 
independent of each others and it is quiet clear that we may 
not know the spatial information before hand. We validate our 
model with data gathered over a real WSN for considerable 
period of time based on the IRIS platform HI. 

A. Contributions 

In this paper we present an appropriate statistical modeling 
i.e., ARIM A{p, d, q) using the data of a real WSN application 
consisting of Crossbow IRIS Motes. We propose an algorithm 
[T] To find suitable ARIMA model and Forecast, which corrects 
the anomalous data at sink for each sensor node with ARIMA 
forecast values at any point of time. The forecast values are 
also used in the algorithm |2] Anomaly Detection for detecting 
anomalous data of a sensor node with 95% confidence interval. 
The algorithms applied for each node are solely dependent 
upon the data stream transmitted by that particular sensor 
node. As the algorithms use past data of individual node 
only, it is imperative that the algorithms do not depend upon 
state of other nodes in the network. We also do not consider 
contextual and temporal relationship among the nodes to 
predict the forecast value. While, if needed, contextual or 
temporal relationship can be used to further smooth our results 
as suggested by ||3J. 



The advantages of the proposed works are following com- 
pare to the earlier works. Our anomaly correction algorithm 
only needs data from the particular node we want to study. 
The proposed algorithm can be used for the purpose of fault 
tolerance in the following way. If few nodes fail to sense data 
due to transient fault at a particular instance of time, still 
we can produce data by processing its old data. Our method 
is highly sophistic method, ARIMA, in statistics time series 
models are known to represent many complex processes than 
any other models. Once the preliminary condition of stationary 
is satisfied then we can use them to represent complex series. 
Finally, all our data processing is to be done at the sink, which 
is suppose to have sufficient power and enough computational 
capability for fitting the statistics models, detecting and cor- 
recting the anomaly for the data of any sensor node. 

B. Related works 

Statistical modeling is used in literature for the purpose 
of data gathering with less number of transmission, anomaly 
detection in the gathered data at sink etc. When sampling of 
data is being done on regular time intervals, we get a time 
series. Time series is a well researched topic in statistical 
mathematics field. An interesting observation has been done 
in case of natural environmental data sampling by wireless 
sensor nodes, the time series is usually stationary in nature. 
This property is used for the purpose of choosing a suitable 
statistical modeling, related papers are explained below. 

A method is proposed by Liu et al. in [4J to reduce the 
transmission in the network. The method uses ARIMA model 
to construct a prediction model for sampled data. Specifically, 
in this method the model is being run on both sensor node and 
at the base station. If the difference between value sampled at 
sensor node and the value forecasted by ARIMA model is 
smaller than a pre-defined tolerance, the value is not transmit- 
ted over network to the base station. In this case base station 
is also running the same model, and hence use the forecasted 
value as actual value. This method has shown suppression 
of data transmission upto 78%. In this method time synchro- 
nization is an big issue when the clocks running at node are 
different from the base station or time delays in reporting. Jain 
and Chang in their paper fSl used recursive models to reduce 
transmission rate by setting the transmission rate adaptively. In 
this paper ||5l, the authors uses Kalman-Filter based estimation 
technique to give new sample rate when required. Adaptively 
adjusting the sampling rate reduces transmission frequency 
which result in energy saving. Anomalous behavior is common 
in WSNs. This fact puts forward a challenge to detect these 
anomalies with high efficiency. This paper |6|, Rajasegarar 
et al. have used Intel Berkeley Research Laboratory (IBRL) 
data to suggest an anomaly detection method using statistical 
concept of Mahanbolis Distance. The authors construct a 
hyperellipsoidal boundary using Mahanbolis distance, which 
provide the boundary for acceptance or rejection of data as 
good or anomaly. Ali et al. in the paper |3| directly deals 
with data cleaning for a set of aggregated data. The technique 
takes into account the contextual association among sensor 



nodes. The method is named as Time of Arrival Data Cleaning 
(TOAD). Depending upon belief range the authors proposed 
a scheme to select proper filter and try to minimize the 
difference between the actual value in the environment and 
the data received at the sink. Clustering can be used to club 
similar data points. In this paper [7|, clustering is done based 
on the measure of dissimilarity among data. The concept of 
Euclidean distance between pair is used to make clusters. 
The method uses Average Inter Cluster Distance (ICD) as the 
criteria for accepting or rejecting anomaly cluster. Another 
similar works proposed by Chitradevi et al. in |8| using the 
concepts of clustering. In this paper a scalable cluster based 
anomaly detection algorithm is proposed by the authors, where 
the algorithm locates anomalous clusters within sensor stream 
and enables the detection of both local and global anomalies. 

II. Basic Idea 

We assume sensor nodes are deployed over a region and 
formed a wireless sensor network. Each sensor node is sam- 
pling data periodically. In our experiment sensor nodes are 
sensing temperature and light data, and transmit data to the 
base station (sink) via multi-hop network. We use multi-hop 
wireless mesh network for our experiment. The aggregated 
data store in a database at the sink for the further processing. 
Before processing further we should identify all anomalous 
data. 

In the context of this paper we observe two types of data 
anomaly. The first type is irregular data which may generate 
due to occurrence or presence of unusual phenomena in the 
monitored region and the second type is erroneous data which 
may generate due to faulty sensor and/or unreliability of 
wireless communication during data transmission toward the 
sink. Considering the above two types of data, we define 
data anomaly which is given below. When data do not follow 
stationary time series then we consider that there is an anomaly 
of the data aggregated at the sink otherwise data are regular 
Both types of data are important for the evaluation purpose 
of a WSN application. In case of irregular data there might 
be a positive signal for occurrence of an unusual phenomena 
otherwise the data are erroneous. In case of erroneous data 
we have to replace or correct the data with an appropriate 
data which should follow stationary time series and remove 
the anomaly. 

The goals of this paper are detecting the anomaly and fixing 
anomaly within the data to each node by ARIMA forecast I9l. 

A. ARIMA Models 

The Auto-Regressive Integrated Moving Average (ARIMA) 
|9| models are a class of models for forecasting a time series. 
Unless a time series is stationary it is not possible to apply 
ARIMA models for forecasting. The property of stationary 
time series is that over time, statistical properties like mean, 
variance are constant. If the initial time series is non-stationary 
then by taking the differences between successive values it is 
possible to make the series stationary. In practice first order 
difference {Xl) is used to make a time series (Tt) stationary. 



where — Tt — Tt-i- The parametric representation of Tt is 
ARIMA(p,d,q). The time series Tt becomes stationary time 
series, Xf after d times differencing over itself for all t. The 
Xf is represented below. 

Xf = 01 X Xf_, + • • • + 0p X Xf^p + Yt + 9ix Yt-i + 

■■■+OgX Yt-q 

Where p is the number of auto-regressive (AR) terms and 
q is the number of moving average (MA) terms. With each 
AR term, there is an associated lagged dependent data sets, 
Xf_^, • • • , Xf_j, and for each MA term there is an associated 
random shocks, Yf , • • • , Yt-q respectively. Examination of the 
partial autocorrelation function (PACF) and the autocorrelation 
function (ACF) are required to get an idea of what orders {p) 
to be considered for the AR component and what orders (q) 
to be considered for the MA component respectively. For a 
given set of data funding appropriate values of p, d and q is 
the main task to fit an appropriate ARIMA model. After that 
the ARIMA{p, d, q) can be used for the entire data set. 

III. Fixing anomalies within particular node 

Objective in this section is to fix anomalies data within 
particular node. The basic idea of time series modeling is to 
make use of autocorrelation structure in the data sampled by 
the node in past. Our method works in three phases: namely 
Data sampling. Applying statistical tests and Forecasting val- 
ues with suitable model. 

Data sampling: In data sampling phase sensor nodes are 
working normally in supervision or sensed environmental 
parameters like light and temperature periodically and transmit 
the sensed data to the sink. Data are time indexed and stored 
at sink with equal time intervals. During this phase sensor 
nodes sample sufficient data to build the statistical model 
ARIMA{p, d, q) with appropriate values of p, d and q. As 
the amount of past data increases the model become more 
efficient in forecast. In this phase, there is no need to maintain 
any extra database at sink excepting the aggregated data. 

Applying statistical test: In this phase we start applying our 
statistical tests to check whether the data sampled in previous 
phase qualify for time series analysis or not. Examine ACF for 
stationarity. The ACF for a non-stationary series shows large 
autocorrelations that diminish only very slowly at large lags. 
If sampled data are non-stationary then differences and further 
ACF examination is required until the transformed time series 
become stationary. We collect the sampled data till the point of 
first anomaly reported in the data stream and apply stationary 
time series tests. Simple way to analyze the autocorrelation 
structure is to plot autocorrelation function (ACF) and partial 
autocorrelation function (R4CF) with respect to different lags. 

Forecasting with suitable model: The objective of finding 
suitable model is to determine the right order of the AR com- 
ponent and the MA component respectively. The traditional 
criteria for ARIMA model selection are Akaike Information 
Criterion and Schwarz Criterion 1 9 1. As we are working at sink 
it is assumed that we have enough computational power to 
run any criteria for model selection. For our purpose we have 
selected Akaike Information Criterion (AIC). Applying AIC 



we can find out AR component order (p) and MA component 
order (q). The d in ARIMA stands for the number of times the 
data have been differenced to render to stationary. And hence 
that will give us our suitable ARIMA model with p, d and q 
parameters. 

Once the model selection is over, we can use the model 
for forecasting and detecting and correcting anomaly with a 
forecast value. Usually five values are good to forecast at 
a single point, as more than five will result in accumulated 
statistical error. It is also possible that for more than five 
consecutive anomalies data for a sensor node, result some 
malicious activity or physical problems, so we restrict the 
forecasting to maximum five steps. To continue after five steps 
we update the data and again start the whole process. 

The following proposed Algorithm[T]can be used for finding 
appropriate value of the parameters p, d and q which fit 
an ARIMA{p, d, q) model, after that the model is used for 
forecasting future values and fixing anomaly. 



Algorithm 1 To find suitable ARIMA model and Forecast 
1: Find autocorrelation structure existence, e.g. by PACF and 

ACF or lag plot 
2: Fit a AR{p), order p calculated with AIC criterion 
3: Fit the residual of step 2 into MA{q) with order q 
4: Make the residual analysis 
5: Forecast future values and fix the anomaly 



The following proposed Algorithm |2] is used for detecting 
anomaly among the sensed data aggregated at sink for a 
particular sensor node. The same algorithm is applicable for 
all sensor nodes. 



Algorithm 2 Anomaly Detection 
1: Forecast future values with ARIMA model generated in 
Algorithm [T] 

2: Find the 95% confidence interval as /i ± 1.96cr where /i- 

forecast value, (T-standard error 
3: Test the null hypotheses: Reported value lies in between 

the above interval. 
4: Depending upon result of step 3, reject or accept reported 

value. 

5: The anomaly is detected in case of rejection in step 4. 



IV. Experimental results 

Battery powered Crossbow IRIS mote platform is used in 
experimental setup. A whole new setup of Xmesh network was 
explored to setup a self sustaining wireless network. Mote 
View software was used to monitor a network deployed in 
real environment with 15 motes. The motes were equipped 
with light sensors, temperature sensors, radio transceiver with 
ATMU 1281 microprocessor The motes are deployed in 
different locations considering the surrounding variation of 
temperature and light, such as inside of a room with AC and 
without AC, outside of the room, on a ladder to roof, near AC 



room compressor etc. Topology of the setup is shown in the 
figure [T] 

The data were collected simultaneously for around 4 hours 
from all motes, as we started our experiment at 4.27PM while 




Fig. 3. Figure showing a plot of PACF versus Lags for tlie data of group 1 



Fig. 1. Networlc Topology 

sunlight was high and temperature was also high. Then we 
stopped our experiment at 8.20 PM, meanwhile it was dark 
and temperature also dropped. The motes were set on high 
power state to aggregate the data at base station. The base 
station forwards the aggregated data to a Laptop. The high 
power state makes network rearrangements at every 36 seconds 
and data sampling at every 2 seconds. We collected data over 
network of 15 motes based on temporal and spatial variation. 
We performed our mathematical analysis based on the received 
data from mote 7 (mote id). The scatter diagram of the first 
6200 sampled data of mote 7 is showing in the figure |2l 
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Fig. 2. Figure showing a scatter plot of the data received from mote 7 

The scatter diagram indicates the presence of anomalous data, 
which are located isolated among the sampled data of mote 
7. The diagram also indicates a rapid drop of the data value 
at around 1000 sample, which is due to the transition time 
of day and night. Here we have separated the data set into 
two groups for analysis purpose. In group 1 we have taken 
sample from 1 to 1000 and in the group 2 we have taken 
sample from 1500 to 4000. First we are analyzing the data 
of Group 1 as follows. Sample data of group 1 is used to 
plot ACF verses lag and PACF verses lag. The ACF verses 
lag and PACF verses lag plots give us an idea how well the 
data fit in ARIMA model. The figure [3] shows that PACF is 
decreasing fast and the figure |4] shows that ACF is decreasing 



exponentially. It implies that the sample data belonging to the 




Fig. 4. Figure showing a plot of ACF versus Lags for the data of group 1 

group 1 are stationary. Therefore, the statistical test ensures 
that the sample data of group 1 are going to fit an ARIMA 
model. 
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Fig. 5. Figure showing a Lag Plot for the data received from mote 7 

Another simple way to analyze the autocorrelation structure 
of the data set is Lag Plot. A lag plot identify non-randomness 



of a data set or time series. Linear pattern of a lag plot ensures 
non-random data and further suggests that an autoregressive 
model might be appropriate [10|. The lag plot of sample data 
give us an idea how well the data fit in ARIMA model. The 
output of lag plot for the data of the group 1 is shown in the 
figure |5] and the straight line behavior ensures that the data are 
going to fit an ARIMA model. 

Now our next task is to find a suitable ARIMA{p, d, q) 
model i.e., the value of the parameters p, d, q for the group 
1 and group 2. As the data of the group 1 are stationary then 
d — 0, now we have to find p and q for respective the groups. 
The algorithm is implemented on SPLUS software pack. It 
gives the best model to be an ARIMA{2, 0, 30) with p = 2 
and q^30 often called ARMA{2, 30) for the data of group 
1. After doing similar statistical analysis we found that the 
data of the group 2 are also stationary i.e., d — and SPLUS 
software pack gives the best model ARIMA{14, 0, 33) which 
is same as ARM A{14:, 33) for the data of group 2 with p = lA 
and q = 33. 

Above ARIMA{2, 0,30) model, we have used for fore- 
casting and that forecast value can be used correcting data 
anomaly. The figure |6] shows a 25 steps forecast of data for 
group 1. In the figure|6] horizontal axis, i.e., a;— axis represents 
forecast steps and vertical axis, i.e., y— axis sample data. The 
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Fig. 6. Figure sliowing 25 step ARIMA forecast 
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Fig. 7. Figure sliowing 25 step standard error for ARIMA forecast 



corresponding error graph is shown in the figure|2]for verifying 
the notion that statistical error accumulates during forecasting. 
In this figure |7] axis represents the corresponding forecast 
steps and y— axis represents the standard error. It is clear from 
the figure|2]that the error increases with the number of forecast 
steps after the first 10 step of forecast. The figure [T] suggests 
that we can use forecast values which are less than the 10 step 
of forecast with insignificant error. 

The table H] shows the calculation for first 5 steps of forecast 
values. The actual values and the corresponding forecasted 
values are there in the table for comparison. It is clear from 
the table that the maximum percentage error we make during 
fixing anomaly is 3.62% only. As stated above 5 steps are well 
enough to support our idea of using the model to 5 steps only. 



TABLE I 

Table showing first 5 steps of forecast values and 
corresponding errors 



Forecast 
Step 


Upper 
Bound 


Lower 
bound 


Actual 
Value 


Forecast 
value 


Forecast 
Std.err 


Error 

% 


1 


369.32 


343.90 


363 


356.61 


6.48 


1.75 


2 


376.92 


347.29 


353 


362.10 


7.55 


2.58 


3 


376.44 


340.64 


346 


358.54 


9.13 


3.62 


4 


379.04 


336.67 


360 


357.85 


10.80 


0.59 


5 


384.79 


335.37 


364 


360.08 


12.60 


1.07 



The same method, which is proposed in Algorithm [T] can 
be used to set the limits for detecting anomalous data. We 
consider any data reported outside 95% confidence interval as 
anomalous data and reject the data according to the proposed 
Algorithm |2] The figure [S] shows the tighter upper and lower 
bound for detecting anomaly, where x— axis represents forecast 
steps and y— axis sample data. But, it is also clear from the 
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Fig. 8. Figure showing actual values, forecasted values and tighter upper 
and lower bounds for detecting anomaly 

same figure that the range of the bounds are increasing after 
the 5th forecast step. In case of rejection due to data anomaly, a 
forecast value is generated by the Algorithm[T]for replacement. 



From the calculated data shown in the table |T] it is clear [6] 
that the maximum error we are tolerating in rejecting correct 
value is 5%. This 5% is Type I error in our hypothesis testing 
stated in Algorithm |2l This 5% values gives rise to 95% 
confidence interval. Depending upon problem we can change 
the confidence interval to find anomalies data. 

As we change the confidence level for our experiment, the 
confidence interval get more narrow or broad. If we want 
tighter bounds then we decrease the confidence level. As we 
keep on decreasing the confidence level the size of interval 
gets smaller. But at the same time we will be loosing some 
good values as anomaly. We also propose that if some node 
is showing repeated anomaly then it is necessary to check the [10] 
node physically for any permanent fault that may have been 
occurred. 

V. Conclusion 

The ARIMA model has been widely used for data modeling 
and prediction. It has ability to capture a wide variety of 
realistic phenomena and is lightweight in terms of memory and 
computational cost. Its importance has not been till recognized 
by the research community of wireless sensor networks. The 
data received by sink is often corrupt, missed, or dirty. In 
order to clean the data produced by WSN, we developed a 
generalized framework using ARIMA model that identifies the 
degree of autocorrelation between past data. To the best of our 
knowledge, this work is the first to utilize the ARIMA model 
for finding anomalies within a stream of data for a single node 
and correct the anomalous data by appropriate forecast values. 
The method protects the application from the abnormalities in 
the data by incorporating aspects such as correlation and time 
of arrival of data. The novelty of this proposed method is that 
it provides a mechanism that informs the system about the best 
suited smoothing process to be used for correcting anomaly. 
We have validated our proposed method by using data from a 
real WSN application over the IRIS platform and demonstrated 
its ability to detect and correct the anomalous data with quite a 
good accuracy. Future research work in this direction includes 
the extension of our idea to detect node anomaly considering 
the spatial relationship among the sensor nodes. 
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